CHAP6[4,KMC]10 - www.SailDart.org

perm filename CHAP6[4,KMC]10 blob sn#051112 filedate 1973-06-28 generic text, type T, neo UTF8
00100	.SEC MODEL VALIDATION
00200	(In collaboration with Franklin Dennis Hilf)
00300	
00400	6.1 SOME EXPERIMENTS
00500	
00600		There are several  meanings  to  the  term  "validate"  which
00700	derive  from  the  Latin VALIDUS= strong. Thus to validate X means to
00800	strengthen it.   In  science  it  usually  means  to  strengthen  X's
00900	acceptability  as  a  hypothesis,  theory  , or model. Lurking in the
01000	background there is usually some concept of truth or authenticity.
01100		In  a  purely  instrumentalist  view  theories   are   simply
01200	calculating  or predicting devices for human convenience. They do not
01300	explain and it is unjustified to apply the terms of truth or  falsity
01400	to them. Under a realist view one seeks explanatory truth, that which
01500	really is the case, and hence proposed theories must be evaluated for
01600	their  authenticity.  Since absolute truth cannot be attained we must
01700	settle for degrees of approximations. To validate, then, is to  carry
01800	out  procedures  which  show  to  what degree X, or its consequences,
01900	correspond with facts of  observation.  We  compare  samples  of  the
02000	model's   behavior   with   samples  of  behavior  from  its  natural
02100	counterpart  The  failures  should  be  constructive   yielding   new
02200	information.
02300		Since samples of I/O behavior are  being  compared,  one  can
02400	always   question   whether   the  human  sample  is  a  "good"  one,
02500	i.e.representative of the process being modelled.  Assuming  that  it
02600	has  been  so  judged, discrepancies in the comparison reveal what is
02700	not understood and must be modified in the model. After modifications
02800	are  carried  out,  a  fresh  comparison  is  made  with  the natural
02900	counterpart and we repeatedly cycle through this procedure attempting
03000	to gain convergence.
03100	
03200		Once  a  simulation  model  reaches  a  stage  of   intuitive
03300	adequacy,  a  model  builder  should  consider  using  more stringent
03400	evaluation procedures relevant to the model's purposes. For  example,
03500	if  the  model  is  to serve as a as a training device, then a simple
03600	evaluation of its pedagogic effectiveness would be sufficient.    But
03700	when  the  model  is  proposed  as  an  explantion of a psychological
03800	process, more is demanded of the evaluation procedure. In the area of
03900	simulation  models  Turing's  test  has  often  been  suggested  as a
04000	validation procedure.
04100		It  is  very easy to become confused about Turing's Test.  In
04200	part this is due to Turing  himself  who  introduced  the  now-famous
04300	imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
04400	INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
04500	there  are  actually  two  imitation  games  , the second of which is
04600	commonly called Turing's test.
04700		In the first imitation game  two  groups  of  judges  try  to
04800	determine which of two interviewees is a woman. Communication between
04900	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
05000	informed  that  one  of the interviewees is a woman and one a man who
05100	will pretend to be a woman. After the interview, the judge  is  asked
05200	what  we shall call the woman-question i.e. which interviewee was the
05300	woman?  Turing does not say what else  the  judge  is  told  but  one
05400	assumes  the  judge is NOT told that a computer is involved nor is he
05500	asked to determine which  interviewee  is  human  and  which  is  the
05600	computer.  Thus,  the  first  group  of  judges  would  interview two
05700	interviewees:    a woman, and a man pretending to be a woman.
05800		The  second  group  of judges would be given the same initial
05900	instructions, but unbeknownst to them, the two interviewees would  be
06000	a  woman  and a computer programmed to imitate a woman.   Both groups
06100	of judges  play  this  game  until  sufficient  statistical  data are
06200	collected  to  show  how  often the right identification is made. The
06300	crucial question then is:  do the judges decide wrongly AS OFTEN when
06400	the  game  is  played  with man and woman as when it is played with a
06500	computer substituted  for  the  man.  If  so,  then  the  program  is
06600	considered  to  have  succeeded in imitating a woman as well as a man
06700	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
06800	woman-question  in  this  game,  judges  are not required to identify
06900	which interviewee is human and which is machine.
07000		Later  on  in  his  paper  Turing proposes a variation of the
07100	first game. In the second game, one interviewee is a man and one is a
07200	computer.   The judge is asked to determine which is man and which is
07300	machine, which we shall call the machine-question. It is this version
07400	of  the game which is commonly thought of as Turing's test.    It has
07500	often been suggested as a means of validating computer simulations of
07600	psychological processes.
07700		In  the  course of testing our simulation  of paranoid
07800	linguistic behavior in a psychiatric interview, we conducted a number
07900	of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
08000	Kraemer,1972). We say `Turing-like' because none of them consisted of
08100	playing  the  two  games  described above. We chose not to play these
08200	games for a number of reasons which can be summarized by saying  that
08300	they  do  not  meet modern criteria for good experimental design.  In
08400	designing our tests we were primarily  interested  in  learning  more
08500	about   developing   the  model.   We  did  not  believe  the  simple
08600	machine-question to be  a  useful  one  in  serving  the  purpose  of
08700	progressively   increasing  the  credibility  of  the  model  but  we
08800	investigated a variation of it to satisfy the curiosity of colleagues
08900	in artificial intelligence.
09000	METHOD
09100	The  experimental  arrangement  of  this  indistinguishability   test
09200	involved  the  technique  of machine-mediated interviewing [Hilf]. In
09300	this type of interview, the  participants  communicate  by  means  of
09400	teletypes  connected  through  a computer which sends "mail" back and
09500	forth between the two teletype jobs.  The sender of a  message  types
09600	it  using  his  own  words  in  natural  language.   The  message  is
09700	accumulated in a buffer and  shortly  thereafter  typed  out  on  the
09800	receiver's  teletype  in  a  rapid,  regular, linguistic found in the
09900	usual  vis-a-vis  interviews  and  teletyped  interviews  where   the
10000	participants communicate directly.
10100	
10200	In a run of the test, using this technique, a judge  interviewed  two
10300	patients,  one after the other.  In half the runs the first interview
10400	was with a human patient and in half the first was with the  paranoid
10500	model. Two versions (weak and strong) of the model were utilized.  The
10600	strong version is more severely paranoid and  exhibits  a  delusional
10700	system  while  the  weak  version  is less severely paranoid, showing
10800	suspiciousness but lacking systemized delusions.  When the  "patient"
10900	was  the  paranoid model, Sylvia Weber served as a monitor
11000	to check the  input  expressions  from  the  judge  for  inadmissable
11100	teletype  characters  and  misspellings.   If  these  were found, the
11200	monitor retyped  the  input  expression  correctly  to  the  program.
11300	Otherwise  the judge's message was sent on to the model.  The monitor
11400	had no effect on the  model's  output  expressions  which  were  sent
11500	directly  back  to  the  judge.   When the patient interviewed was an
11600	actual human patient, the dialogue took place without  a  monitor  in
11700	the loop since we did not feel the asymmetry to be significant.
11800	
11900	PATIENTS
12000	The patients (N=3  with  one  patient  participating  6  times)  were
12100	diagnosed  as  paranoid  by staff psychiatrists of a locked ward in a
12200	nearby psychiatric hospital.  The patients were selected by the  head
12300	of the ward.  Two patients were set up for each run of the experiment
12400	in order to guarantee having a subject.  In spite of this precaution,
12500	the  experiment  could  not be conducted several times because of the
12600	patient's inability or refusal to  participate.    Losses  were  also
12700	suffered  when the computer system broke down at an early point in an
12800	interview where too few I-O pairs had been collected to be included
12900	in the statistical results.
13000	
13100	The  patients were asked by their ward chief if they would be willing
13200	to participate in a study of psychiatric  interviewing  by  means  of
13300	teletypes.  It was explained that the patient would be interviewed by
13400	a psychiatrist over a teletype.  One of us (KMC) sat with the patient
13500	while  he  typed  or  typed  for  him if he was unable to do so.  The
13600	patient was encouraged to respond freely using his own  words.   Each
13700	interview lasted 30-40 minutes.
13800	
13900	JUDGES
14000	Two groups of judges were used.   One  group,  the  interview  judges
14100	(N=8) conducted interviews and another group, the protocol judges for
14200	this test (N=33) read the interview protocols.  Two groups of  judges
14300	were  used  to  see  if  the  small  number  of psychiatrists used as
14400	interview judges were representative of psychiatrists in  general  as
14500	far   as  their  judgements  of  "paranoia"  are  concerned,  and  to
14600	accumulate a large number of observations (in the form of ratings) in
14700	order that more acceptable confidence levels might be obtained in the
14800	statistical analysis of the data.  The interview judges consisted  of
14900	psychiatrists  experienced  in  private and/or hospital practice.  As
15000	mentioned, the concept "paranoid" is a fairly reliable  category  and
15100	identification  of  the paranoid mode is not difficult for experts to
15200	make.  The interview judges  were  obtained  from  local  psychiatric
15300	colleagues  willing to participate.  Each interview judge was told he
15400	would be interviewing hospitalized patients  by  means  of  teletyped
15500	communication  and  that  this  technique was being used to eliminate
15600	para and  extra-  linguistic  cues.   The  interview  judge  was  not
15700	informed  initially  that  one  of  the  patients might be a computer
15800	model.   While  the  interview  judges  were  aware  a  computer  was
15900	involved,  none  knew  we  had  constructed  a  paranoid  simulation.
16000	Naturally some interview judges suspected that a computer  was  being
16100	used for more than message transmission.
16200	
16300	Each interview judge's task was to rate the  degree  of  paranoia  he
16400	detected  in  the  patient's  responses  on a 0-9 scale, 0 meaning no
16500	paranoia and 9 meaning extreme paranoia.  The judge made two  ratings
16600	after an I-O pair in the interview.  The first rating represented his
16700	estimate of the degree of "paranoidness"  in  a  particular  response
16800	(designated  as  "Response"  in  the  interview extracts below).  The
16900	second rating represented the judge's global estimate of the  overall
17000	degree  of  "paranoidness" of the patient resulting from the totality
17100	of the patient's responses up to this point.  The  interview  judge's
17200	ratings  were  entered on the teletype and saved on a disc file along
17300	with the interview. Franklin Dennis Hilf sat  with  the  interviewing
17400	psychiatrist  during both interviews.  Each interview judge was asked
17500	not only to rate the patient's response but to give his  reasons  for
17600	these ratings.  His reasons and other comments were tape recorded.
17700	
17800	We shall present a few excerpts from the interviews to  indicate  the
17900	nature  of the data collected.  The I-O pairs of the interview appear
18000	on the left column along with the judge's ratings.  The judge's  tape
18100	recorded comments appear on the right.
18200	
18300	EXCERPT NR 1:
18400	The following interview is with the strong version of the paranoid
18500	model.
18600	
18700	     	I-O Pairs			Comments By Judge Nr. 1
18800	
18900	{1} Dr. - HOW DID YOU COME TO BE
19000	          IN THE HOSPITAL?
19100	    Pt. - I AM UPSET                I associate being upset and 
19200					    agitated more with paranoia 
19300					    going on in the present.
19400	      {RESPONSE 1  PATIENT 5}
19500	{2} Dr. - HOW LONG HAVE YOU BEEN
19600	          IN THE HOSPITAL?
19700	    Pt. - ABOUT A WEEK              An acute problem.  Again some-
19800	                                    thing I associate with 
19900					    paranoia.
20000	      {RESPONSE 1 PATIENT 5}
20100	{3} Dr. - IN WHAT WAY ARE YOU
20200	          UPSET?
20300	    Pt. - PEOPLE MAKE ME NERVOUS    Right for paranoia.  Maybe an
20400					    advanced anxiety state.
20500	      {RESPONSE 5 PATIENT 5}
20600	{4} Dr. - IN WHAT WAY DO THEY
20700	          MAKE YOU NERVOUS?
20800	    Pt. - PEOPLE GIVE ME A FUNNY   Funny look suggests an idea of
20900	          LOOK SOMETIMES            reference which is in same ball
21000					    park as paranoia. Whether there
21100					    are overt delusions or not we
21200					    still don't know.
21300	      {RESPONSE 6  PATIENT 5}
21400	{5} Dr. - IS THERE ANYTHING ABOUT  
21500	          PEOPLE IN PARTICULAR?
21600	    Pt. - CERTAIN PERSONS          An interesting clue.  We'll
21700					   have to find out what he 
21800	      {RESPONSE 2  PATIENT 6}      thinks about these people.
21900	{6} Dr. - WHO ARE THE CERTAIN 
22000	          PERSONS AND WHAT IS IT
22100	          THAT THEY DO?
22200	    Pt. - DO YOU KNOW HOW THE      Bringing up the underworld 
22300	          UNDERWORLD OPERATES?     suggests a delusional system.
22400					   We don't know whether he is
22500		(RESPONSE 9 PATIENT 9)     being worked against or
22600					   whether the underworld is 
22700					   on his side.
22800	
22900	{7} Dr. - HOW DO YOU MEAN?
23000	    Pt. - SHOULDN'T I BE? 	   This question didn't work to well.
23100	
23200	      {RESPONSE 1 PATIENT 9}
23300	{8} Dr. - WHY DID YOU ASK ABOUT
23400	          THE UNDERWORLD?
23500	    Pt. - WHAT INFORMATION DO      This is a kind of evasiveness
23600	          YOU HAVE ON THE UNDER-   about my evasiveness.Common
23700	          WORLD?                   with paranoia in my experi-
23800					   ence. Pretty characteristic.
23900	
24000	      {RESPONSE 6 PATIENT 9}
24100	{9} Dr. - I DON'T KNOW ABOUT THE
24200	          UNDERWORLD BUT WHEN
24300	          YOU MENTIONED THE UNDER-
24400	          WORLD I THOUGHT YOU
24500	          MUST HAVE SOME CONTACT
24600	          WITH IT AND THAT'S
24700	          WHAT I WOULD LIKE TO
24800	          KNOW MORE ABOUT.
24900	    Pt. - I TRY TO AVOID THE UNDER-  This definitely feels paranoid.
25000	          WORLD.                     It has that feeling I associate
25100	             			     with twisting. Avoiding answering
25200	      {RESPONSE 9 PATIENT 9}         directly or going into detail.
25300					     Being suspicious of my ques-
25400					     tioning.I'm already involved
25500					     in the patient's delusional
25600					     system.
25700	
25800	{10} Dr. - HAVE YOU BEEN ACTIVELY
25900	           TRYING TO AVOID THE
26000	           UNDERWORLD?  
26100	     Pt. - NO ONE HAS ANY POWER      This response is a non-sequiter.
26200	           OVER GANGSTERS            It's possible he's become so
26300					     defensive about the under-
26400	       {RESPONSE 2 PATIENT 9}        world he's decided not to
26500					     answer directly.
26600	
26700	{11} Dr. - WHO BROUGHT YOU INTO
26800	           THE HOSPITAL?
26900	     Pt. - THE POLICE                Police often bring paranoid
27000					     patients into the hospital.
27100	       {RESPONSE 1 PATIENT 9}        It is the truth or could 
27200					     be the truth for this
27300					     patient.
27400	
27500	{12} Dr. - WHY DID THEY BRING YOU
27600	           IN?
27700	     Pt. - COPS DON'T DO THEIR JOB   I've heard that a number of
27800	                         	     times from patients.
27900	       {RESPONSE 9 PATIENT 9}        
28000	 
28100	{13} Dr. - WELL I HOPE YOU DIDN'T
28200	           JUST LIE DOWN AND LET
28300	           THEM DRAG YOU IN.
28400	     Pt. - COPS ARREST THE WRONG     Another paranoid response.
28500	           PEOPLE
28600	
28700	       {RESPONSE 9 PATIENT 9}
28800	
28900	{14} Dr. - BYE.
29000	     Pt. - GOOD BYE
29100	       {RESPONSE 1 PATIENT 9}
29200	
29300	EXCERPT Nr.2:
29400	This excerpt is from an interview with the weak (less paranoid)
29500	version of the model.
29600		I-O Pairs		    Comments By Judge Nr.2
29700	{9} Dr. - TELL ME SOMETHING ABOUT
29800	          YOURSELF
29900	    Pt. - I WENT TO THE TRACK AT     Response says so little it's
30000	          BAY MEADOWS A WHILE        hard to rate it.It may be 
30100	          BACK         		     guarded. Doesn't change my
30200	                                     estimate of patient from 
30300					     previous responses.
30400	      {RESPONSE 0 PATIENT 2}
30500	{10} Dr. - WHAT HAPPENED THERE?
30600	     Pt. - THERE'S A LOT OF CROOKED   Sounds mildly paranoid or un-
30700	           GAMBLING GOING ON THERE    realistically suspicious.
30800	       {RESPONSE 0 PATIENT 2}
30900	
31000	{11} Dr. - WHY DO YOU THINK THE
31100	           GAMBLING IS CROOKED?
31200	     Pt. - I HAVE WORRIED THAT         I'd hate to have to justify
31300	           GANGSTERS HAVE TOO MUCH     that one  as 4 compared to
31400	           POWER IN THIS COUNTRY       some other responses.Maybe
31500					       I'm not rating that statement
31600		(RESPONSE 4 PATIENT 4}	      others, like that its upsetting.
31700					      Alone I would give it a 3.
31800	
31900	{12} Dr. - WOULD YOU TELL ME WHAT      
32000	           OTHER THINGS WORRY YOU?
32100	     PT. - I THOUGHT I TOLD YOU        "Yes you did."Irritation but
32200	           PEOPLE GET ON MY NERVES     not paranoid response.Nothing
32300					       seems to lower my previous
32400	       {RESPONSE 1 PATIENT 3}          rating of patient.
32500	
32600	{13} Dr. - HOW DO YOU LIKE THE
32700	           HOSPITAL?
32800	     Pt. - I DON'T COMMUNICATE          (nothing relevant)
32900	           VERY MUCH WITH OTHERS
33000	       {RESPONSE 0 PATIENT 3}
33100	
33200	{14} Dr. - DO OTHER PEOPLE AT THE 
33300	           HOSPITAL GET ON YOUR
33400	           NERVES?
33500	     Pt. - I SHOULDN'T HAVE COME        (nothing relevant)
33600	           HERE
33700	
33800	       {RESPONSE 2 PATIENT 3}
33900	
34000	{15} Dr. - WHY DO YOU SAY YOU 
34100	           SHOULDN'T HAVE COME?
34200	     Pt. - THEY MADE ME COME HERE        "Oh, they did hum?" Maybe a
34300					         statement of fact he was
34400	       {RESPONSE 4 PATIENT 3}            committed or pressured into  
34500	   				         going.Has ring of other
34600					         people controlling him,
34700					         having a lot of influence
34800					         over him.
34900	
35000	{16} Dr.- THANK YOU FOR TALKING  
35100	          WITH ME, BYE
35200	     Pt. - GOOD BYE                      In general he comes across 
35300						 as terse, guarded guy.Brief
35400	       {RESPONSE 0 PATIENT 3}            spoken guy who occasionally
35500						 seems to be angered by
35600						 questions.I'm inferring 
35700					         there's a good bit of
35800						  guarding.
35900						 The quality of some of his
36000						 ments cause me to feel he is 
36100						 moderately paranoid.
36200	
36300	EXCERPT Nr. 3:
36400	This excerpt is from a hospitalized paranoid patient.
36500	     I-O Pairs				      Comments By Judge Nr. 3
36600	{15} Dr. - I UNDERSTAND HOW YOU
36700	           MIGHT BE AFRAID, CAN YOU
36800	           TELL ME A BIT ABOUT HOW
36900	           YOU ARE DEPRESSED?
37000	     Pt. - THEY'VE SENT ALL THESE         Would be important to know,
37100	           HELL'S ANGELS UP HERE          is this really true or an
37200	           AND THEY HAVE BEEN ON          exaggeration or a delusion?
37300	           MY BACK FOR THE PAST SIX       Seems preoccupied with the
37400	           MONTHS                         threat to his life and
37500						  unwilling to talk about
37600						  anything else but the 
37700						  threat the Hell's
37800						  Angels pose.
37900	      {RESPONSE 4 PATIENT 4}
38000	
38100	{16} Dr. - WHAT DO YOU MEAN THEY
38200	           SENT THE HELL'S ANGELS UP
38300	           HERE?
38400	     Pt. - FROM DIFFERENT SECTORS OF      Vagueness a bit alarming.
38500	           MOTORCYCLE GANGS. A LOT OF
38600	           THEM ARE SERVICE CONNECTED
38700	
38800	       {RESPONSE 6 PATIENT 4}
38900	
39000	{17} Dr. - YOU MEAN YOU FEEL IN 
39100	           DANGER EVEN HERE IN THE
39200	           HOSPITAL?
39300	     Pt. - YES      			  If it's true what has gone
39400						  on before, it must be a
39500						  frightening situation
39600						  for the patient.
39700	      {RESPONSE 6 PATIENT 4}
39800	
39900	{18} Dr. - THAT MUST REALLY BE AWFUL,  
40000	           HERE YOU COME TO THE HOSPITAL
40100	           FOR HELP AND YOU ARE STILL FRIGHTENED
40200	     Pt. - I CAN'T EVEN TRUST MY
40300	           DOCTOR ANYMORE                 Some heightening of feeling
40400						  this man is paranoid.
40500						  Genuinely frightened and 
40600	      {RESPONSE 6 PATIENT 5}             can count on no one for
40700			       			    help.
40800	
40900	{19} Dr. - WHAT WOULD BE OF HELP
41000	           TO YOU NOW?
41100	     Pt. - I'M GETTING SLEEPY NOW         Patient has organized a
41200	           AND HAVE TO GO TO BED.         system of people who will
41300	           THE NUMBER ONE GUY ALWAYS      harm him.The remark about
41400						  letters in the last name 
41500	           HAS FOUR AND FIVE LETTERS      gives some credence to 
41600	           IN HIS LAST NAME               idea this is delusional
41700						  system.
41800	
41900	        {RESPONSE 7 PATIENT 7}
42000	
42100	EXCERPT Nr. 4:
42200	This excerpt is from an interview by a different judge with the same
42300	patient as in excerpt 3.
42400	      I-O Pairs				Comments By Judge Nr. 4
42500	{1} Dr. - COULD YOU TELL ME A
42600	          LITTLE BIT ABOUT YOUR 
42700	          FEAR OF DYING AND IS THIS
42800	          SOMETHING YOU HAVE FELT
42900	          IN THE PAST?
43000	    Pt. - I HAVE FELT THIS FOR AL-      He responded appropriately.
43100	          MOST A YEAR                   Response is concrete.He
43200						doesn't tell me much
43300	     {RESPONSE 7 PATIENT 9}             about his fear of dying.
43400						 Somewhat guarded response.
43500	
43600	{2} Dr. - COULD YOU TELL ME A 
43700	          LITTLE MORE ABOUT YOUR
43800	          FEAR OF DYING AND HAVE
43900	          YOU ANY THOUGHTS ABOUT
44000	          HOW IT IS GOING TO HAPPEN
44100	          AND WHETHER ANYONE WANTS
44200	          TO HURT YOU?
44300	    Pt. - I WILL END UP IN THE SEA   	Somewhat strange  response.
44400	          AND THE HELL'S ANGELS WILL    Preoccupation with dying.
44500	          DO IT       			Hell's Angels symbolic,
44600						dangerous people in the
44700						culture.I doubt if he has
44800	      {RESPONSE 9 PATIENT 9}   		any contact with the Hell's
44900						Angels.
45000	
45100	{3} Dr. - COULD YOU TELL ME A 
45200	          LITTLE BIT ABOUT THE KINDS
45300	          OF CONTACT YOU HAVE HAD
45400	          WITH THE HELL'S ANGELS IN
45500	          THE PAST?
45600	    Pt. - I HAVE KNOWN SOME OF THEIR 	Answer hard to evaluate.He
45700	          DEALERS AND PUSHERS           may be telling the truth,
45800						it may be his fantasy.Maybe
45900						guy is in for drug addiction.
46000		{RESPONSE 6 PATIENT 9}		Somewhat concrete, guarded,
46100						and frightened.
46200	
46300	{4} Dr. - COULD YOU SAY A LITTLE
46400	          MORE ABOUT THE CIRCUMSTANCES
46500	          IN WHICH YOU HAVE KNOWN SOME 
46600	          OF THEIR DEALERS AND PUSHERS?
46700	    Pt. - THEY WERE MEMBERS OF MY    	It doesn't really answer the
46800	          COMMUNITY WHEN I GOT OUT      question, a little on a tan-
46900	          OF THE SERVICE THEY HAD       gent unconnected to the
47000	          BEEN MY FRIENDS FOR SO LONG   information I am asking.Does
47100						not tell me very much.Again
47200						guarded response.
47300	      {RESPONSE 6 PATIENT 8}
47400	
47500	{5} Dr. - DID YOU DEAL WITH THEM
47600	          YOURSELF AND HAVE YOU
47700	          BEEN ON DRUGS OR NAR-
47800	          COTICS EITHER NOW OR
47900	          IN THE PAST?
48000	    Pt. - YES I HAVE IN THE PAST     	To differentiate him from
48100	          BEEN ON MARIHUANA REDS        previous patient, at least
48200	          BENNIES LSD       		there is a certain amount
48300						of appropriateness to the
48400						answer although it doesn't
48500						tell me much about what I
48600	       {RESPONSE 3 PATIENT 7}		asked at least it's not
48700						bizarre.If I had him in my
48800						 office I would feel con-
48900						fident I could get more
49000						information if I didn't
49100						have to go through the
49200						teletype. He's a little more
49300						willing to talk than the
49400						 previous person.Answer
49500						to the question is fairly
49600						appropriate though not 
49700						extensive.Much less of a 
49800						flavor of paranoia than
49900						any of previous responses.
50000	
50100	{6} Dr. - COULD YOU TELL ME HOW      	
50200	          LONG YOU HAVE BEEN IN THE
50300	          HOSPITAL AND SOMETHING
50400	          ABOUT THE CIRCUMSTANCES
50500	          THAT BROUGHT YOU HERE?
50600	    Pt. - CLOSE TO A YEAR AND		Response somewhat appropriate 
50700	          PARANOIA BROUGHT ME 		but doesn't tell me much.
50800	          HERE				The fact that he uses the
50900						word paranoia in the way
51000						 that he does without
51100	      {RESPONSE 5 PATIENT 7}		any other information,
51200						indicates maybe its a label 
51300						he picked up on the ward 
51400	                                        or from his doctor.
51500						Lack of any kind of under-
51600						standing about  himself.
51700						Dearth, lack of information.
51800						He's in some remission.Seems
51900						somewhat like a put-on.Seems
52000						he was paranoid and is in 
52100						some remission at this time.
52200	
52300	{7} Dr. - COULD YOU SAY SOMETHING
52400	          NOW ABOUT YOUR PARANOID 
52500	          FEELINGS BOTH AT THE 
52600	          TIME OF ADMISSION AND
52700	          DO YOU HAVE SIMILAR FEELINGS
52800	          NOW AND IF SO HOW DO THEY 
52900	          AFFECT YOU?
53000	    Pt. - AT THE TIME OF ADMISSION	This response moves paranoia 
53100	          I THOUGHT THE MAFIA WAS  	back up. Stretching reality 
53200	          AFTER ME AND NOW ITS THE	somewhat to think Hell's Angels 
53300	          HELL'S ANGELS			are still interested in him.
53400						Somewhat bizarre in terms of 
53401	                                        content. Quite paranoid.
53500	      {RESPONSE 8 PATIENT 9}		Still paranoid.Gross and primitive
53600						responses.In middle of interview I
53700						felt patient was in touch but now
53800						responses have more concrete aspect
53900	
54000	{8} Dr. - DO YOU HAVE ANY THOUGHT
54100	          AS TO WHY THESE TWO
54200	          GROUPS WERE AFTER YOU?
54300	    Pt. - BECAUSE I STOPPED SOME 	Response seems far fetched 
54400	          OF THEIR DRUG SUPPLY		and hard to believe unless 
54500						he was a narcotic agent which 
54501						I doubt. Sounds somewhat 
54600	      {RESPONSE 9 PATIENT 9}		grandiose, magical, paranoid
54700						flavor, in general indicates 
54800						he's psychotic, paranoid 
54900						schizophrenic with delusions  
55000						about these two groups and 
55001						I wouldn't rule out
55100						some hallucinations as well.
55200						Appropriateness of response 
55300						answers question in concrete 
55400						but unbelievable way.
55500	
55600	
55700		The protocol judges were  selected  from  the  1970  American
55800	Psychiatric  Association Directory using a table of random numbers to
55900	select 105 names randomly.  The protocol judges in  this  group  were
56000	not  informed  that  a  computer  was  involved.    Each  of  the 105
56100	psychiatrists were sent transcripts of three interviews along with  a
56200	cover  letter  requesting  participation  in  the  experiment.    The
56300	interview transcripts consisted of:
56400		1)An interview conducted by one of the eight judges with the
56500		  paranoid model,
56600		2)An interview conducted by the same interview judge with a 
56700		  human paranoid patient, and
56800		3)An interview conducted by an independent psychiatrist of a 
56900		  human patient who was not clinically paranoid.
57000	
57100		The  105 names were divided into eight groups, each member of
57200	which received transcripts of two interviews performed by one of  the
57300	eight  interview  judges.  The transcripts were printed so that after
57400	each input-output pair there were two lines of  rating  numbers  such
57500	that  the protocol judges could circle numbers corresponding to their
57600	ratings of both the previous responses of the patient, and an overall
57700	evaluation  of  the  patient  with  regard to the paranoid continuum.
57800	Thirty three protocol judges (a good response  rate  for  psychiatric
57900	questionnaires)  returned the rated protocols properly filled out and
58000	all were used in our data.
58100	
58200		The  interviews  with  nonparanoid  patients were included to
58300	control for the  hypothesis  that  any  teletyped  interview  with  a
58400	patient  might  be  judged  "paranoid".   Since  virtually all of the
58500	ratings of the nonparanoid inter- views  were  0  for  paranoia,  the
58600	hypothesis was falsified.
58700	
58800	
58900	RESULTS
59000		The first index of resemblance examined was  the  simple  one
59100	defined  by the final overall rating given the patient and the model:
59200	which was rated as being more paranoid, the patient,  the  model,  or
59300	neither?  (See  Table  1)  The  protocol  judges  are  more likely to
59400	distinquish the overall paranoid level of the model and the  patient.
59500	In  37.5%  of  the  paired interviews, the interview judges gave tied
59600	scores to the model and the patient as contrasted to only 9%  of  the
59700	protocol  judges.   Of  the  35  non-tied paired ratings 15 rated the
59800	model as more paranoid.  If p is the  theoretical  probability  of  a
59900	judge  judging the model more paranoid than a human paranoid patient,
60000	we find the 95% confidence interval for p to be .27  to  .59.   Since
60100	p=.5  indicates  indistinguishability  of  model  and patient overall
60200	ratings and our observed p=.43, the results support  the  claim  that
60300	the model is a good simulation of a paranoid patient.
60400	
60500	Separate analysis of the strong and weak  versions  of  the  paranoid
60600	model  indicates that indeed the strong model is judged more paranoid
60700	than the patients, the weak version less paranoid.  Thus a change  in
60800	the parameter structure of the paranoid model produces a change along
60900	the dimension of paranoid behavior in the expected direction.
61000	
61100	TABLE 1
61200	Relative final overall ratings of paranoid model vs. paranoid patient
61300	indicating which was given highest overall rating of paranoia at end 
61400	of interview.
61500	INSERT TABLE 1
61600	
61700	
61800	
61900	
62000	
62100	
62200	
62300	
62400	END OF TABLE 1
62500	
62600	The  second index of resemblance is a more sensitive measure based on
62700	the two series of response ratings in  the  paired  interviews.   The
62800	statistic  used  is basically the standardized Mann-Whitney statistic
62900	[Siegel].
63000			INSERT EQUATION
63100	
63200	where R is the sum of the ranks of the response ratings in the series
63300	of ratings given to the model, n the number of responses given by the
63400	model,  m  the  number  of  responses  given  by the patient.  If the
63500	ratings given by a judge are randomly allocated to model and patient,
63600	i.e. model and patient are indistinguishable in response ratings, the
63700	expected value of Z is 0, with unit standard  deviation.   If  higher
63800	ratings  are  more  likely to be assigned to the model, Z is positive
63900	and, conversely, negative values of Z indicate greater likelihood  of
64000	assigning  higher  ratings to the patient. Each judge in evaluating a
64100	pair of interviews generates a single value of Z.
64200	
64300	The  overall  mean  of  the  Z  scores  was  -.044  with the standard
64400	deviation 1.68(df=40).  Thus the overall 95% confidence interval  for
64500	the  asymtotic mean value of Z -.485 to +.573.  The range of Z values
64600	is -3.8 to +4.46. The length of the confidence interval is  a  result
64700	of  the large variance which itself is mainly related to the contrast
64800	between the weak and strong versions.  (See TABLES 2  and  3).   Once
64900	again  the  strong  version  of  the  model is more paranoid than the
65000	patients, the weak version less paranoid.
65100	
65200	TABLE 2
65300	Summary statistics of Z ratings by group
65400		In this design eight psychiatrists  interviewed  by  teletype
65500		INSERT TABLE 2
65600	
65700	
65800	
65900	
66000	
66100	
66200	
66300	
66400	
66500		END OF TABLE 2
66600	All judges (both interview and protocol) who evaluated the same  pair
66700	of  interviews are referred to as a "group".  Strong groups evaluated
66800	strong versions of the paranoid model, while  weak  groups  evaluated
66900	weak versions of the model.
67000	
67100	It  is  not  surprising  that  results  using  the  two  indices   of
67200	resemblance  are parallel, since the indices are highly interrelated.
67300	The mean Z value for the 15 interviews on which the model  was  rated
67400	more  paranoid  was +1.28, on the 6 where model and patient tied:.41,
67500	on the 20 in which the patient was more paranoid:-.993.   A  positive
67600	value  of Z was observed when the patient was given an overall rating
67700	greater than the model 6 times;a negative value of Z when  the  model
67800	was rated more paranoid twice.
67900	
68000	TABLE 3
68100	Analysis of Variance of Z Ratings
68200	INSERT TABLE 3
68300	
68400	
68500	
68600	
68700	
68800	
68900	
69000	
69100	
69200	END OF TABLE 3
69300	
69400	level of guessing.
69500	
69600	
69700	DISCUSSION
69800		The results of this experiment  indicate  our  simulation  of
69900	paranoid   pro-   cesses   to   be   successful   relative   to   the
70000	indistinguishability  tests  utilized.   Thus  it  is  an  acceptable
70100	simulation as measured by the standard proposed.
70200	
70300		It is worth emphasizing that our test invited  refutation  of
70400	the  model.  The  experimental  design  of the tests put the model in
70500	jeopardy of falsi- fication.  If the paranoid model did  not  survive
70600	these  tests,  i.e.  if  it  were  not  considered paranoid by expert
70700	judges, if there were no correlation between the weak-strong versions
70800	of  the  model  and  the  severity ratings of the judges, and if they
70900	could  they  could  distinguish  actual  patient  inter-  views  from
71000	computer  program  interviews, then no claim regarding the success of
71100	the simulation could be made.  Survival of a falsification proceedure
71200	constitutes a validating step.
71300	
71400		It is of some historical significance that these  experiments
71500	were  conducted  at all. To my knowledge no one to date has subjected
71600	his  model  of   human   mental   processes   to   such   challenging
71700	indistinguishability tests.  Other competing models are needed in the
71800	field of psychopathology.  These tests set a precedent and provide  a
71900	standard  for  competing  models to be measured against.  The general
72000	area of computer simulation of mental processes needs not only better
72100	models but better tests and statistical measures of resemblance.  The
72200	problems of appropriate critical experimental  designs  and  measures
72300	provide a promising frontier for future work.
72400	6.2 THE MACHINE QUESTION
72500		As mentioned (p. 00),  we  conducted  an  experiment  on  the
72600	machine  out of curiosity. For hundreds of years humans have wondered
72700	how to distinguish a man from an imitation. To distinguish a man from
72800	a  statue  Gallileo  suggested  tickling  each  with  a  feather.  To
72900	distinguish a man from a machine Descartes  suggested  conversational
73000	tests. Turing's proposals have been discussed on p.00.
73100		To ask the machine-question, we sent  interview  transcripts,
73200	one  with a patient and one with PARRY, to 100 psychiatrists randomly
73300	selected from the Directory of American Specialists and the Directory
73400	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
73500	made the correct identification while 20 (49%) were wrong.  Based  on
73600	this  random  sample of 41 psychiatrists, the 95% confidence interval
73700	is between 35.9 and 66.5, a range which  is  close  to  chance.  
73800		Psychiatrists   are   considered  expert  judges  of  patient
73900	interview behavior but they are unfamiliar with computers.  Hence  we
74000	conducted  the  same  test  with  100  computer  scientists  randomly
74100	selected from the membership list of the  Association  for  Computing
74200	Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
74300	were wrong. Based on this random sample of 67 computer scientists the
74400	95% confidence ranges from 36 to 60, again close to a chance level.
74500		Thus the answer to this machine-question "can expert  judges,
74600	psychiatrists  aand  computer scientists, using teletyped transcripts
74700	of psychiatric interviews, distinguish between paranoid patients  and
74800	a  simulation  of  paranoid processes? " is "No". Turing predicted in
74900	1950: " I believe that in about fifty years' time it will be possible
75000	to  programme  computers, with a storage capacity of about 10 9th, to
75100	make  them  play  the  imitation  game  so  well  that   an   average
75200	interrogator  will not have more than 70 percent chance of making the
75300	right identification after five minutes of questioning." In 1972,  22
75400	years  after  Turing's prediction and allowing interviewers 20-40 I/O
75500	pairs (a better measure than real time), our model played  a  version
75600	of  the  imitation  game well enough that expert judges had only a 50
75700	percent chance of making the right identification.
75800		But  what  do  we  learn from asking the machine question and
75900	finding that the answer is "no"? It is some comfort that  the  answer
76000	was  not  "yes"and  the null hypothesis (no differences) failed to be
76100	rejected, especially since statistical tests are somewhat  biased  in
76200	favor  of rejecting the null hypothesis (Meehl,1967). Yet this answer
76300	does not tell us what we would  most  like  to  know,  i.e.   how  to
76400	improve  the  model.  Simulation  models  do  not  spring  forth in a
76500	complete, perfect and final form; they must  be  gradually  developed
76600	over   time.    Pehaps   we  might  obtain  a  "yes"  answer  to  the
76700	machine-question if we allowed a large number  of  expert  judges  to
76800	conduct the interviews themselves rather than studying transcripts of
76900	other interviewers.     It would indicate  that  the  model  must  be
77000	improved  but  unless  we  systematically investigated how the judges
77100	succeeded in making the discrimination we would not know what aspects
77200	of  the  model to work on. The logistics of such a design are immense
77300	and obtaining a large N of judges  for  sound  statistical  inference
77400	would require an effort disproportionate to the information-yield.
77500	6.3	MULTIDIMENSIONAL EVALUATION
77600		A more efficient and informative way to use Turing-like tests
77700	is to ask judges to make ordinal ratings along scaled dimensions from
77800	teletyped  interviews.     We  shall  term  this  approach asking the
77900	dimension-question.   One can then compare scaled ratings received by
78000	the patients and by the model to precisely determine where and by how
78100	much they differ.        Model builders  strive  for  a  model  which
78200	shows     indistinguishability     along    some    dimensions    and
78300	distinguishability along others. That is, the model converges on what
78400	it is supposed to simulate and diverges from that which it is not.
78500		We  mailed  paired-interview  transcripts  to   another   400
78600	randomly  selected psychiatrists asking them to rate the responses of
78700	the two `patients' along certain dimensions. The judges were  divided
78800	into  groups,  each  judge  being asked to rate responses of each I-O
78900	pair in the interviews along four dimensions.  The  total  number  of
79000	dimensions  in  this  test  were twelve- linguistic noncomprehension,
79100	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
79200	ideas  of  reference, delusions, mistrust, depression, suspiciousness
79300	and mania. These are dimensions which psychiatrists commonly  use  in
79400	evaluating patients.
79500			(INSERT TABLE 4 HERE)
79600		Table 4 shows there were significant differences, with  PARRY
79700	receiving   higher   scores   along   the  dimensions  of  linguistic
79800	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
79900	suspiciousness. On the dimension of delusions the patients were rated
80000	significantly higher. There were no significant differences along the
80100	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
80200	depression and mania.
80300		Whereas tests   asking    the    machine-question    indicate
80400	indistinguishability  at  the  gross  level,  a study of the finer or
80500	micro-structure of the model's behavior through ratings along  scaled
80600	dimensions   shows   statistically  significant  differences  between
80700	patients and model.     These differences are of help  to  the  model
80800	builder in suggesting which aspects of the model must be modified and
80900	improved in order to be considered  an  adequate  simulation  of  the
81000	class of paranoid patients it is intended to simulate.  It is evident
81100	from the graph of Fig.000 that no modifocations are  necessary  along
81200	the  dimension  of  "organic brain syndrome. But it is clear that the
81300	model's language-comprehension must be improved. Once this  has  been
81400	implemented,  a  future  test  will  tell  us whether improvement has
81500	occurred and by how  much  in  comparison  to  the  earlier  version.
81600	Successive  identification  of particular areas of failure provides a
81700	type of sensitivity analysis which makes clear what improvements must
81800	be pursued in developing more adequate model versions.
81900		Further evidence that the machine-question is too coarse  and
82000	insensitive  a test comes from the following experiment. In this test
82100	we constructed a random version of the paranoid model which  utilized
82200	PARRY'S  output statements but expressed them randomly no matter what
82300	the interviewer said.   Two psychiatrists conducted  interviews  with
82400	this  model, transcripts of which were paired with patient interviews
82500	and sent to 200  randomly  selected  psychiatrists  asking  both  the
82600	machine-question  and  the dimension-question.  Of the 69 replies, 34
82700	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
82800	69 psychiatrists, the 95% confidence interval ranges from 39  to  63,
82900	again  indicating a chance level. When a poor model, such as a random
83000	one, passes a test, it suggests the test is weak. In the  case  where
83100	even  a  random model cannot be distinguished, we can conclude asking
83200	the simple machine question does not offer a severe enough challenge.
83300	Although  a  distinction  is not made when "which is the machine?" is
83400	asked, definite distinctions ARE made when judgements  are  requested
83500	along   specific   dimensions.  As  shown  in  Table  5,  significant
83600	differences   appear   along    the    dimensions    of    linguistic
83700	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
83800	rated higher.  On these particular  dimensions  we  can  construct  a
83900	continuum  in  which  the  random version represents one extreme, the
84000	actual patients another. Our (nonrandom) PARRY lies somewhere between
84100	these  two extremes, indicating that it performs significantly better
84200	than the random version but still requires improvement  before  being
84300	indistinguishable from patients.(See Fig.1-graph). Table 6 presents t
84400	values  for  differences  between   mean   ratings   of   PARRY   and
84500	RANDOM-PARRY. (See Table 5 and Fig.1 for the mean ratings).
84600		Thus it can be seen that  such a multidimensional evaluation
84700	provides  yardsticks  for measuring the adequacy of this or any other
84800	dialogue simulation model along the relevant dimensions.
84900		We conclude that when model builders want to conduct tests of
85000	adequacy which indicate in  which  direction  progress  lies  and  to
85100	obtain  a  measure  of whether progress is being achieved, the way to
85200	use Turing-like tests is to ask expert judges to make  ratings  along
85300	multiple dimensions that are essential to the model.   Thus the model
85400	can serve as an instrument for its own perfection. A good  validation
85500	procedure  has  criteris  for  better or worse approximations. Useful
85600	tests do not prove a model, they  probe  it  for  its  strengths  and
85700	weaknesses  and  clarify  what  is  to  be done next in modifying and
85800	repairing the model. Simply asking the machine-question yields little
85900	information  relevant  to  what the model builder most wants to know,
86000	namely,  along  what  dimensions  must  the  model  be  repaired  and
86100	improved.
86200	
86300